Mining : Foundations , Techniques and ApplicationsFinite - State Transducers for Semi - Structured Text

نویسنده

  • Chien-Chi Chang
چکیده

Text mining for semi-structured documents requires information extractors. Programming extractors by hand is diicult to catch up with the amount and the variation of the documents placed on the WorldWide Web everyday. This paper presents our recent result on applying machine learning techniques to au-tomatize the generation of the extractors. Our goal is to develop a domain and language independent approach that automatically learns an extractor from training examples of extraction. In particular, this paper discusses the use of nite-state transducers (FST) as the representation formalism of the extractors. Previously, we have shown that only a small number of examples is enough for learning perfect FST-based extractors for documents from a variety of real Web sites. In this paper, we introduce two classes of FST-based extractors: single-pass and multi-pass, and informally analyze their relative advantages and disadvantages. In terms of their sample complexity, we hypothesized that single-pass extractors may have an edge for tabular documents while multi-pass ex-tractors may require fewer examples for tagged-list documents. We veriied this hypothesis empirically and found that by slightly changing the problem setting, a single-pass learner can outperform a multi-pass learner for tagged-list documents. A prototype system based on this work will be presented in the demo session of this workshop.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Detailed Study on Text Mining Techniques

Text Mining is an important step of Knowledge Discovery process. It is used to extract hidden information from not-structured or semi-structured data. This aspect is fundamental because most of the Web information is semistructured due to the nested structure of HTML code, is linked and is redundant. Web Text Mining helps whole knowledge mining process in mining, extraction and integration of u...

متن کامل

A Study of Text Mining Methods, Applications,and Techniques

Data mining is used to extract useful information from the large amount of data. It is used to implement and solve different types of research problems. The research related areas in data mining are text mining, web mining, image mining, sequential pattern mining, spatial mining, medical mining, multimedia mining, structure mining and graph mining. Text mining also referred to text of data mini...

متن کامل

From Faceted Classification to Knowledge Discovery of Semi-structured Text Records

The maintenance and service records collected and maintained by the aerospace companies are a useful resource to the in-service engineers in providing their ongoing support of their aircrafts. Such records are typically semi-structured and contain useful information such as a description of the issue and references to correspondences and documentation generated during its resolution. The inform...

متن کامل

A Web Text Mining Flexible Architecture

Text Mining is an important step of Knowledge Discovery process. It is used to extract hidden information from notstructured o semi-structured data. This aspect is fundamental because much of the Web information is semi-structured due to the nested structure of HTML code, much of the Web information is linked, much of the Web information is redundant. Web Text Mining helps whole knowledge minin...

متن کامل

Efficient Text and Semi-structured Data Mining: Knowledge Discovery in the Cyberspace

This paper describes applications of the optimized pattern discovery framework to text and Web mining. In particular, we introduce a class of simple combinatorial patterns over texts such as proximity phrase association patterns and ordered and unordered tree patterns modeling unstructured texts and semi-structured data on the Web. Then, we consider the problem of finding the patterns that opti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999